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Executive  Summary 


In  this  project,  we  develop  a  text  analysis  tool,  employing  a  topic  model  methodology,  for 
distribution  to  military  analysts.  Currently,  analysts  across  the  Department  of  Defense  often 
encounter  data  sets  containing  vast  amounts  of  unstructured  free  text  documents.  The  majority 
of  these  analysts  lack  either  the  technical  expertise  or  system  availability  to  employ  code-based 
native  language  processing  and  topic  modeling  software  tools.  Additionally,  many  of  the  current 
topic  modeling  methodologies  function  as  a  black  box  and  do  not  allow  the  analyst  employ  their 
domain  expertise  during  the  development  of  the  topic  model. 

The  requirement  to  deliver  a  text  analysis  tool  that  is  easily  deployable  across  the  Department  of 
Defense  computing  environment  limits  the  programming  languages  available  for  software 
development.  Since  the  Microsoft  Office  Suite,  including  Excel,  is  ubiquitous  across  the  DoD 
computing  environment  the  Visual  Basic  for  Applications  (VBA)  programming  language 
presents  the  option  as  our  programming  language  of  choice. 

We  propose  an  innovative  topic  modeling  technique  to  overcome  the  limitations  of  the  Visual 
Basic  for  Applications  language,  while  still  delivering  a  software  solution  capable  of  supporting 
the  analysis  of  large  datasets.  Specifically,  we  propose  a  topic  modeling  methodology  using  K- 
means  clustering  to  estimate  the  posterior  probabilities  of  the  topic  distributions  within  the 
Latent  Dirichlet  Allocation  family  of  topic  models.  This  estimation  method  replaces  the 
collapsed  Gibb  sampling  estimation  technique  currently  in  use  in  most  Latent  Dirichlet 
Allocation  topic  models.  The  K-means  clustering  estimation  method  produces  results  with 
similar  or  better  accuracy  versus  the  collapsed  Gibbs  sampling  method  while  significantly 
reducing  the  computational  time.  We  then  incorporate  K-means  Latent  Dirichlet  Allocation 
topic  models  into  the  Subject  Matter  Expert  Refined  Topic  methodology  to  arrive  at  K-means 
Subject  Matter  Expert  Refined  Topic  methodology. 

We  also  develop  a  software  instantiation  of  the  K-means  Subject  Matter  Expert  Refined  Topic 
methodology  that  is  available  both  as  a  standalone  Excel  spreadsheet  and  as  an  Excel  add-in. 

This  software  allows  analysts  without  a  coding  background,  or  access  to  other  computational 
programming  environments,  to  build  topic  models  from  free  text  datasets  using  a  familiar  Excel 
based  graphic  user  interface. 


ES-1 


This  page  left  intentionally  blank. 


Chapter  1.  Introduction 


Our  primary  objective  is  the  development  of  a  text  analysis  tool,  employing  a  topic  model 
methodology,  for  distribution  to  military  analysts.  This  technical  report  documents  the  key 
elements  of  the  K-means  Subject  Matter  Expert  Refined  Topic  (KSMERT)  methodology  and 
related  software  based  tool.  Chapter  1  describes  the  project  background,  use  cases,  and 
applications.  Chapter  2  covers  the  major  innovation  for  this  project,  which  is  the  development  of 
a  topic  model  estimation  method  using  K-means  clustering.  Chapter  3  describes  the 
development  of  a  software  instantiation  of  the  KSMERT  methodology  suitable  for  deployment 
across  the  Department  of  Defense  (DoD)  computing  environment.  Finally,  Chapter  4 
communicates  our  conclusions  and  recommendations  for  future  research. 

Background 

Analysts  across  the  DoD  often  encounter  data  sets  containing  vast  amounts  of  unstructured  free 
text  documents.  The  majority  of  these  analysts  lack  either  the  technical  expertise  or  system 
availability  to  employ  code-based  native  language  processing  and  topic  modeling  software  tools. 
Additionally,  many  of  the  current  topic  modeling  methodologies  function  as  a  black  box  and  do 
not  allow  the  analyst  employ  their  domain  expertise  during  the  development  of  the  topic  model. 
With  these  challenges  in  mind,  the  objective  of  this  project  is  the  development  of  a  text  analysis 
tool,  employing  a  topic  model  methodology,  which  is:  easily  deployable  across  the  majority  of 
DoD  computing  systems,  allows  analysts  to  incorporate  their  domain  knowledge  into  the  topic 
model  development,  and  does  not  require  analysts  to  have  a  specific  coding  background. 

We  build  on  the  foundational  work  by  Allen,  Xiong,  and  Afful-Dadzie  (2015)  that  proposes 
Subject  Matter  Expert  Refined  Topic  (SMERT)  models.  This  includes  a  method  to  incorporate 
SME  domain  specific  knowledge  as  an  intermediate  step  in  fitting  Latent  Dirichlet  Allocation 
(LDA)  topic  models.  This  type  of  refinement  is  often  necessary  so  that  the  topics  align  with  pre¬ 
existing  classifications,  context  specific  needs,  and  have  dispersion  across  the  observation  space 
of  interest. 

Use  Cases  and  Applications 

We  use  the  Network  Integration  Exercise  (NEE)  events  run  by  the  Brigade  Modernization 
Command  (BMC)  at  Fort  Bliss,  Texas  as  the  primary  use  case.  The  BMC,  with  analytic  support 
from  TRADOC  Analysis  Center- White  Sands  Missile  Range  (TRAC-WSMR),  is  responsible 
for  assessing  the  viability  and  maturity  of  multiple  systems  during  each  NIE  event.  One  of  the 
primary  data  collection  methods  for  these  events  is  direct  observation  reports  written  by  observer 
controls  and  observer  analysts  that  each  contain  multiple  free  text  data  fields.  This  use  case 
provides  direct  parallels  to  the  larger  problem  scope  in  that  analysts  of  varied  technical  acumen 
must  analyze  a  body  of  unstructured,  free  text  data  in  an  environment  the  restricts  access  to  other 
software  based  text  analytic  tools.  Opportunities  to  deploy  developmental  versions  of  the 
methodology  and  software  provide  extensions  to  the  base  use  case  and  further  demonstrate  the 
need  for  a  deployable  topic  modeling  tool. 
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These  use  cases  only  capture  a  small  subset  of  the  potential  applications  for  both  the  underlying 
methodology  and  deployable  software  that  this  research  seeks  to  produce.  Multiple  communities 
across  the  DoD  face  the  challenge  of  rapidly  analyzing  large  corpora  of  free  text  data.  The 
potential  application  include  human  intelligence  and  signals  intelligence  reports  in  the 
intelligence  community,  network  and  administer  logs  in  the  cyber  community,  and  free  text 
survey  responses  in  the  training  and  doctrine  community. 
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Chapter  2.  A  Topic  Model  Estimation  Method  Based  on  K- 

Means  Clustering 


Overview 

Latent  Dirichlet  Allocation  is  a  clustering  method  widely  used  for  creating  interpretable  topics 
from  text  corpora.  SMERT  models  are  a  generalization  of  LDA,  invented  by  our  researchers, 
that  permits  SMEs  to  edit  and  improve  the  topic  definitions.  Unfortunately,  the  current  methods 
for  fitting  LDA  models,  collapsed  Gibbs  sampling  and  variational  inference  estimation,  lack 
repeatability  and  are  computationally  expensive. 

To  address  these  limitations,  we  propose  an  innovative  topic  modeling  technique  that  uses  K- 
means  clustering  to  fit  LDA  models.  We  then  combine  this  novel  LDA  technique  with  the 
previously  developed  SMERT  model  to  arrive  at  our  KSMERT  model.  This  method  is  able  to 
take  advantage  of  users’  knowledge  in  directing  data  manipulations  to  achieve  much  more 
accurate  and  meaningful  results  within  a  reasonable  duration,  despite  the  high  computational 
loads  when  handling  large  text  data  sets.  We  illustrate  the  methodology  using  four  small  case 
study  examples  and  conclude  that  KSMERT  offers  desirable  accuracy  and  computational  cost 
tradeoffs  with  wide  applicability  in  military  and  other  contexts. 

Literature  Review 

Text  analytics  is  the  process  of  analyzing  unstructured  text,  extracting  relevant  information  and 
transforming  it  into  a  structured  form  for  further  use  in  analytic  process  (Packiam  and  Prakash, 
2015).  For  example,  text  analytics  could  aid  in  systematic  summarization  of  field  reports, 
interviews  of  relevant  leaders,  and  insights  from  analysts.  Here,  we  focus  on  one  type  of  text 
analytics  called  topic  modeling  (e.g.,  see  Blei,  Ng,  and  Jordan,  2003).  By  dividing  the  corpus 
into  clusters  or  topics,  these  models  can  clarify  what  is  missing  and  what  is  present  in  the  entire 
corpus.  The  identification  of  the  latent  structure  of  the  corpus  also  allows  topic  models  to  inform 
information  retrieval  reminiscent  of  a  “Google”  search  but  for  unstructured,  unindexed,  and 
untagged  data  sets. 

Blei,  Ng,  and  Jordan  (2003)  propose  the  use  of  mean  field  variational  inference  methods  to 
estimate  the  parameters  of  LDA  based  topic  models.  Teh  (2007)  criticizes  the  accuracy  of  mean 
field  variational  inference  and  argues  that  unbiased  collapsed  Gibbs  sampling  is  comparably 
computationally  efficient  with  improved  accuracy.  Blei,  Ng  and  Jordan  (2003)  also  argue  that 
topic  models  are  qualitatively  more  relevant  than  non-generative  clustering  models,  including  K- 
means.  We  postulate  that  they  overlook  the  potential  of  transforming  clustering  results  to  create 
generative  models  and  propose  a  K-means  based  estimation  method  of  the  model  parameters  as  a 
computationally  faster  and  reproducible  method  for  fitting  LDA  models. 

Topic  model  methods  that  incorporate  expert  knowledge  are  a  topic  of  ongoing  research.  For 
example,  Zhao,  Li,  Li,  Wang,  Ding,  and  Li  (2012)  propose  two  supervised  topic  models  devised 
by  tracking  previous  history  text  data  as  background  knowledge  using  on  interaction  matrix.  Sun 
(2014)  proposes  a  term  frequency-inverse  document  frequency  model,  which  is  a  keyword  model 
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reflecting  how  important  a  word  is  in  a  document.  However,  these  models  lack  semantic 
structure,  especially  for  multiple  probabilistic  distributions  over  the  vocabulary.  For  this  reason, 
the  LDA  method  Blei,  Ng,  and  Jordan  (2003)  propose  is  the  more  widely  accepted  method  for 
clustering  unsupervised  images  or  text  documents. 

In  our  work,  we  focus  on  the  method  of  incorporating  expert  knowledge  proposed  in  Allen, 
Xiong,  and  Afful-Dadzie  (2015).  They  introduce  the  SMERT  model  to  permit  analysts  to 
incorporate  their  domain  specific  knowledge  to  edit  the  topics  while  maintaining  the  LDA  topic 
model  structure.  Sui,  Milam,  and  Allen  (2015)  shows  that  SMERT  could  estimate  the  proportion 
of  words  in  the  overall  corpus  on  each  topic  and  incorporate  “high-level”  inputs  from  a  SME  to 
adjust  the  topics  by  confirming  or  denying  the  membership  of  words  in  the  topic  definitions.  We 
incorporate  our  K-means  clustering  method  for  fitting  LDA  models  into  the  SMERT  model  to 
arrive  at  our  KSMERT  model. 

Review  of  Data  Preparation  and  Topic  Models 

This  section  reviews  the  preparation  of  text  data  for  clustering  and  information  retrieval  and 
describes  the  likelihood  that  defines  both  SMERT  and  LDA.  Estimating  the  parameters  in  the 
likelihood  is  the  objective  in  the  next  section. 

Preparing  Data  for  Text  Modeling 

Feldman  and  Sanger  (2006)  propose  a  generalized  view  of  text  mining  system  architectures 
composed  of  four  main  phases:  preprocessing  tasks,  core  mining  operations,  presentation  layer 
components  and  browsing  functionality,  and  refinement  techniques.  The  K-means  estimation 
technique  we  propose  in  this  research  follows  these  same  four  phases  shown  in  Figure  2.1. 


Text  Document 


Analysis 

Visualization,  Topic 
Analysis 


■> 


Preprocessing 

Recognizing  Tokens, 
Keywords  Extraction, 
Making  Dictionary 

J 


Processing 

Documents 

Assigning/Clustering 

Topics 


Users 

Users/Experts 
Refining  Topics 


Figure  2.1.  Four  phase  method  for  text  analysis  adapted  from  Feldman  and  Sanger  (2005). 


Phases  1.  Preprocessing  Tasks  include  all  the  preparation  of  raw  data  for  text  mining  core 
operations.  These  preparations  include  cleaning  the  data  source  to  convert  it  into  a  canonical 
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format.  In  our  methods,  this  processing  includes  recognizing  tokens  or  words  based  on  space, 
trimming  words  into  word  roots,  making  a  dictionary  and  assigning  a  corresponding  number 
index  for  each  word.  Here,  we  use  the  algorithm  from  Porter  (1980). 

Phases  2.  Core  Mining  Operations  are  the  core  operations  in  text  analytics  including  pattern 
discovery,  trend  analysis,  and  modeling  algorithms.  In  our  method,  processing  documents  is  the 
core  part  with  pattern  discovery  through  word  frequency.  The  main  goal  of  our  algorithm  is  the 
clustering  and  grouping  of  topics  and  words.  Here,  we  fit  the  LDA  and  K-means  model  forms  for 
mining. 

Phases  3.  Presentation  Layer  Components  include  browsing  functionality  and  visualization 
tools.  In  this  phase,  the  primary  visualization  tool  to  view  different  topics  and  their  contents  is 
the  Pareto  chart. 

Phases  4.  Refinement  Techniques  include  methods  to  filter  information  through  pruning, 
generalizing  or  suppressing  approaches  to  achieve  discovery  optimization.  Based  on  Allen, 
Xiong,  and  Afful-Dadzie  (2015),  in  this  phase  our  method  incorporates  users’  human  domain 
knowledge  and  enables  analysts  to  directly  supervise  model  results  and  refine  the  results  while 
keeping  the  model  simple. 

Topic  Model  Notation 

We  use  a  topic  model  notation  that  closely  follows  notation  that  Blei,  Ng,  and  Jordan  (2003) 
propose  for  their  LDA  models.  In  our  notation,  wdn  is  the  nth  word  in  document  d  =  1  and 
n  =  1  ,...,Nd-  Therefore,  “D”  is  the  number  of  documents  and  “NJ’  is  the  number  of  words  in  the 
dth  document.  The  number  of  words  created  in  the  dictionary  is  denoted  as  “WC”.  The  WC- 
dimensional  random  vector  0t  represents  the  probability  that  randomly  selected  words  are 
assigned  to  each  pixel  in  the  topic  indexed  by  t  =  1 . . .  T. 

The  posterior  mean  of  (pt  defines  the  topics.  The  prior  parameters  a  and  /?  are  usually  scalars  in 
that  all  documents  and  all  words  are  initially  treated  equally.  Generally,  low  values,  or  “diffuse 
priors”,  are  assigned  so  that  only  a  small  amount  of  shrinkage  is  applied  and  adjustments  are 
made  on  a  case-by-case  basis  (Griffiths  and  Stuyvers,  2004).  The  sampled  topic  assignments 
(zd,n)  permit  estimation  of  the  topic  definitions  (/?).  The  most  common  words  in  each  topic  are 
often  the  most  relevant  outcome.  The  7-dimcnsional  random  vector  6d  represents  the  probability 
that  a  randomly  selected  word  in  document  d  is  assigned  to  each  of  the  T  topics  or  clusters. 

Subject  Matter  Expert  Refined  Topic  (SMERT)  and  Latent  Dirichlet 
Allocation  (LDA) 

With  the  parameter  definitions  above,  SMERT  joint  distribution  or  likelihood  that  defines  the 
initial  SMERT  model  is  simply  the  product  of  the  individual  conditional  densities: 
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where  w  and  x  are  matrices  of  the  data  and  6d  and  (pt  are  vector  model  parameters  to  be 
estimated.  Specifically,  <pt  has  elements  <fit  c.  The  vectors,  a  and  f3t,  contain  prior  parameters 
that  might  be  assumed  to  have  all  their  elements  equaling  the  same  values,  a0  and  /?0.  Also, 
effective  constants  include  N  the  vector  of  document  lengths  and  Nt  the  matrix  of  trial  counts. 

The  constituent  parts  of  equation  (2.1)  are  the  Dirichlet,  categorical,  and  binomial  densities. 
Collecting  all  the  parts,  equation  (2.1)  becomes: 
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c  xt,c 
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where 


nd,Q  =  Y.Nj=1l7=lKZd,j  =  t&WdJ  =  C ), 

nQ,c  =  =  t&Wdj  =  c). 


(2.3) 


The  left  two  rectangles  in  the  graphical  model  representation  of  Figure  2.2  show  the  conditional 
relationships  between  the  variables  in  the  LDA  model.  In  the  figure,  the  rectangles  indicate  the 
number  of  elements  in  each  random  vector  or  matrix.  For  example,  the  matrices  z  and  w  have  Nd 
elements  for  each  of  the  D  documents.  LDA  has  Nd  equal  to  zero  for  all  cl. 
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Figure  2.2.  Graphical  model  representation  of  SMERT. 

The  posterior  mean  values  for  the  topic  definitions  (p  and  topic  proportions  9  are  estimated 
through  a  single  replicate  of  the  topic  assignments  after  some  level  of  convergence  (Blei,  Ng, 
and  Jordan,  2003).  These  posterior  means  give  a  conceptual  map  of  the  corpus  because  the 
words  with  highest  probabilities  in  the  topic  definitions  offer  the  most  meaningful  cluster 
definitions.  The  proportions  for  each  topic  summarize  the  corpus  and  prioritize  later 
visualization  parts  (Steyvers  and  Griffiths,  2007). 

Techniques  for  Estimation  of  the  Posterior  Probabilities 

The  main  computational  challenge  for  topic  modeling  in  LDA  is  the  approximate  estimation  for 
the  posterior  probabilities.  The  two  previously  proposed  estimation  methods  are  collapsed  Gibbs 
sampling  (Teh,  2007  and  Steyvers  and  Griffiths,  2007)  and  mean  field  variational  inference  (Blei 
et  al.,  2003). 


Collapsed  Gibbs  Sampling  Methods 


Allen,  Xiong,  and  Afful-Dadzie  (2015)  fit  the  distribution  in  SMERT  using  collapsed  Gibbs 
sampling  which  is  a  type  of  Markov  Chain  Monte  Carlo  process.  Collapsed  Gibbs  sampling  is 
an  iterative  process  of  modifying  the  topic  assignments  and  distributions.  The  topic  assignments 
converge  to  the  samples  from  the  new  distribution  and  are  then  used  for  estimations  for  the 
topics  and  proportions. 


A  major  component  of  collapsed  Gibbs  sampling  is  the  topic  selection  n)  for  nth  word  in  the 
mth  document.  Let  v  be  the  word  and  the  topic  assignments  for  other  words  be  Z_(jn  ny  Let  qj  r 
br  the  counts  of  the  number  of  samples  of  topic  i  in  document  j  of  the  rth  word.  We  use  (.)  to 
denote  the  sum  over  counts.  Then  each  sampling  is  calculated  as: 
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Z(mn)  is  therefore  from  a  single  multinomial  draw  and  then  the  iteration  moves  to  the  next  word. 

Mean  Field  Variational  Inference 

Blei  et  al.  (2003)  proposes  the  variational  inference  estimation  method  to  approximate  an 
intractable  posterior  distribution  over  hidden  variables  with  a  much  simpler  one  with  free 
variational  parameters.  Topics  f3k  is  described  by  a  V-Dirichlet  distribution  Afc.  Topic 
Proportion  9d  is  described  by  a  K-Dirichlet  distribution  yd.  Topic  assignment  zd  n  is  described 
by  a  K-multinomial  distribution  <pd  n.  The  main  iteration  is  then: 

Step  1.  For  each  topic  k  and  word  v,  =  rj  +  YJd= 1  £n=i  l(.wd,n  =  v)  0n,/c  (2-5) 

Step  2.  For  each  document  d: 

(a)  Update  yd:  y =  ak  +  Zn=i  <t>d,n,k  (2-6) 

(b)  For  each  word  n,  update  (pdn  4>d^k  oc  exp{V(y^k1)+  V(A^n)  -  V(Z£=i  A  (2.7) 

where  V  is  the  digamma  function,  the  first  derivative  of  log  /"  function.  The  iterations  are 
repeated  until  the  minimization  of  Kullback-Leibler  function  for  the  variational  parameters 
converges. 

K-means  based  Subject  Matter  Expert  Refined  Topic  (KSMERT) 

In  practice,  not  all  of  the  distribution  is  relevant  to  the  user  nor  will  they  find  the  expression  of 
all  topics  as  an  ordered  list  of  words  interpretable.  To  address  this  shortcoming  of  LDA  topic 
models  Allen,  Xiong,  and  Afful-Dadzie  (2015)  developments  the  SMERT  method  for 
probabilistic  clustering  of  texts. 

The  third  rectangle  in  Figure  2.2  is  the  additional  part  necessary  for  a  SMERT  model.  The  two 
left-hand-side  rectangles  are  identical  to  LDA  with  multinomial  response  data,  w.  The  right- 
hand-side  begins  with  the  arrow  from  N  to  x,  which  introduces  binomially  distributed  response 
data,  xtc  for  t  =  1  and  c  =  1,. .  .,WC.  xt  c  represents  the  number  of  times  in  a  given  topic,  t, 
word  c  is  selected  in  Nt  c  trials.  Note  that  the  choice  of  Nt  c  in  the  model  is  arbitrary.  Allen, 
Xiong,  and  Afful-Dadzie  (2015)  refers  to  the  right-hand-side  portions  in  Figure  2  as  “hierarchal 
analysis  designed  latency  experiments”  (HANDLEs)  because  it  permits  users  to  interact  with  the 
model.  “Hierarchical  analysis”  means  the  use  of  a  hierarchical  Bayesian  formulation. 

“Designed”  indicates  that  the  users  can  incorporate  their  domain  specific  knowledge  about  the 
result  to  direct  further  data  manipulations.  “Latency  experiments”  indicates  that  the  model  has 
relatively  high  leverage  on  specific  latent  variables,  i.e.,  (f).  Figure  2.3  shows  the  flow  chart  of 
SMERT.  After  the  initial  LDA  run,  the  user  can  study  the  results  and  then  apply  domain  specific 
expertise  as  the  high-level  data  to  achieve  the  second  stage  result  model  2. 
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Figure  2.3.  Flow  chart  of  SMERT  with  iterative  additions  of  high-level  data. 


The  primary  disadvantages  of  existing  LDA  methods  is  that  they  require  multiple  iterative  runs 
making  them  computationally  expensive  with  noisy  and  unrepeatable  results.  Our  approach  of  a 
topic  modeling  technique  using  a  K-means  estimation  method  is  capable  of  achieving  much 
faster  and  repeatable  results. 


For  each  data  point,  the  distance  to  the  centroid  of  the  belonging  cluster  is  calculated.  The 
membership  function  is  then  computed  as  the  inverse  of  the  distance.  If  the  distance  is  zero,  the 
membership  is  set  as  one.  The  T dimensional  membership  function  vectors  ul5...,uD  can  be 
explained  as  the  probabilities  that  the  data  point  belongs  to  the  associated  clusters  which  are 
topics,  0lv .  .,0£,.  Next  the  membership  functions  are  scaled 


Erf  ud,t 

Efe=lErf  ucL,k 


for;  = 


1  T 
x  1  •  •  • 


(2.8) 


as  the  topic  definitions,  which  show  the  distribution  of  words  in  the  topics.  The  cluster  centroids 
are  represented  as  zl5. .  .,zT,  and  the  centroids  are  scaled  using 


0t,c 


Zc.t 


?wc , 


£ C,t 


■for  t  =  for  c  =  \  ,...,WC. 


(2.9) 


as  the  topic  proportions,  which  show  the  distribution  of  topics  in  all  the  document  lists. 

Numerical  Studies 

This  section  describes  the  four  test  problems  and  two  evaluation  metrics  used  to  evaluate  the 
accuracy  of  the  K-means  topic  modeling  methodology.  Additionally,  the  results  from  the 
comparisons  and  related  findings  follow. 

Test  Problems 

Here,  four  similar  cases  provide  the  ability  to  compare  different  estimation  methods.  Table  2.1 
summarizes  the  computational  results  for  the  timing.  Appendix  C  contains  an  example  test 
problem  and  the  corresponding  true  model  topic  distribution.  For  all  test  problems  in  this 
research  we  use  corpora  of  40  documents  ( D  =  40)  and  a  dictionary  size  of  25  words  (  WC  =  25) 
for  all  case. 
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Evaluation  Metrics 


The  estimated  distribution  for  topics  has  no  natural  ordering  so  it  is  hard  to  compare  the  results 
against  the  assumed  ground  truths.  Therefore,  Steyvers  and  Griffiths  (2007)  propose  the 
evaluation  of  each  permutation  of  the  cluster  labels  before  selecting  the  permutation  with  the 
closest  distance.  Define  the  function  t'(r,  t)  as  the  selection  of  topic  t  in  permutation  r.  Use 
to  denote  the  true  topic  definitions  and  attrue  to  denote  the  true  topic  proportions  for  t  = 

1  and  for  c  =  1,...,W. 

The  minimum  average  Kullback-Leibler  divergence  (KLD)  for  the  topic  definitions  is: 

KLD(<P)  =  mipiEI-i  Sffi  (2!0) 

res  T  V<Pt'(r,t),c/ 

Further,  denote  r*  as  the  argmax  permutation  for  equation  (2.10).  Another  measure  of  distance 
is  the  average  root  mean  squared  (RMS): 

fiMsow  =  (sLi  Jsswr  -  ■#><> -.o.  j2-  (2.11) 

The  accuracy  measures  for  the  topic  proportions  are  thus: 

(  true  \ 

— - ) 

at'(r,t)/ 

and 

RMS  (a)  =J’Zl=1(a£rue  -  at>c 

Comparison  of  Results 

The  next  two  sections  show  how  we  compare  the  performance  of  K-means  LDA,  as  the  initial 
step  to  KSMERT,  to  the  existing  methodologies.  The  two  measures  of  performance  for  this 
comparison  are  model  accuracy  and  computational  time. 

Model  Accuracy 

Figure  2.4  displays  the  results  of  K-means  LDA,  Gibbs  Sampling  LDA  with  10,  100,  and  1000 
iterations,  and  variational  inference  LDA  against  the  true  models  for  all  the  four  cases.  Using  the 
RMS  metric,  K-means  LDA  could  achieve  a  similar  level  of  distance  or  even  smaller  distance  to 
the  true  model  compared  with  other  models.  Lor  Gibbs  sampling,  Monte  Carlo  simulation 
introduces  uncertainties.  A  higher  number  of  iterations  produces  a  slightly  better  RMS  than 
lower  numbers,  but  the  trend  is  highly  influenced  by  the  random  seed. 

Figure  2.5  shows  the  comparison  of  the  various  LDA  and  associated  SMERT/KSMERT 
implementations  for  Case  4  only.  The  accuracy  results  for  all  of  the  SMERT/KSMERT 
implantations  are  similar  due  to  the  user  employing  domain  knowledge  to  direct  data 
manipulations. 


(2.12) 


(2.13) 
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Figure  2.4.  RMS  comparison  for  different  estimation  methods  for  LDA  only. 
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Methods 


Figure  2.5.  RMS  comparison  for  different  estimation  methods  for  LDA  and  SMERT  (Case 

4). 


Computational  Time 

The  running  time  in  minutes  for  Collapsed  Gibbs  sampling  LDA  is  roughly  predicted  using: 


Run  time(collapsed  Gibbs)  = 


-1899.1766  +  0.2736  *  number  Documents  +  17.9817  *  numberFielcLs  + 
38.5079  *  TopicNumber. Value  +  0.9342  *  MaxNumberlteration.Value 

60 


(2.14) 


For  K- means-based  estimation,  the  run  time  grows  as: 
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Run  time(k-means-based)  = 


-391.4993  +  0.1001  *  number  Documents  +  11.1628  *  numberFields  + 
9.3168  *  TopicN umber. Value  +  0.58986  *  MaxNumberlteration.Value 


(2.15) 


60 

As  the  data  sets  grow  in  size,  the  runtime  of  collapsed  Gibbs  sampling  estimation  method  grows 
at  a  rate  more  than  twice  that  of  the  K-means  estimation  method.  The  K-means  LDA  achieves 
these  reductions  in  runtime  while  providing  a  level  of  accuracy  similar  to  the  true  models. 
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Chapter  3.  KSMERT  Software  Development 


Overview 

In  addition  to  proposing  the  K-mean  estimation  method  and  KSMERT  models,  we  develop  a 
corresponding  software  instantiation  of  the  methodology  that  is  easily  deployable  to  analysts 
operating  in  a  DoD  computing  environment.  Since  the  Microsoft  Office  Suite,  including  Excel, 
is  ubiquitous  across  the  DoD  computing  environment  the  Visual  Basic  for  Applications  (VBA) 
programming  language  presents  the  best  overall  value  for  development  of  the  KSMERT  software 
instantiation.  We  acknowledge  that  VBA  has  some  drawbacks,  including  slow  runtimes  and 
lack  of  external  libraries,  but  believe  it’s  ability  to  support  easy  deployment  to  the  DoD  analytic 
workforce  outweighs  these  drawbacks. 

Developmental  Concept 

The  four  lines  of  effort  in  the  development  of  the  KSMERT  software  are: 

•  Core  subroutines  which  transform  text  to  numbers  and  create  clusters  and  word 
assignments, 

•  Human-computer  interaction  which  include  data  visualizations  and  knowledge 
elicitation  from  the  SMEs, 

•  Method  and  code  testing  which  are  built-in  ways  to  evaluate  the  core  subroutines  for 
verification  and  validation  (the  outcome  of  our  quality  assurance  plan),  and 

•  Code  sharing  methods  which  include  the  ability  to  install  the  code  as  an  add-in  similar 
(in  some  respects)  to  the  excel  solver. 

Figure  3.1  shows  how  the  key  subroutines  and  features  of  the  software  development  fit  into  the 
four  areas  of  effort.  The  items  highlighted  with  double  thick  boards  are  those  specifically 
developed  over  the  course  of  this  project  while  the  other  items  represent  those  carried  forward 
from  previous  effort  by  the  research  team. 
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Figure  3.1.  Categories  and  modules  in  KSMERT  with  thick  borders  indicating 

innovations. 


Details  Related  to  the  Code  Additions 

This  section  contains  additional  details  for  each  of  the  modules  of  the  software  we  develop  to 
provide  a  deployable  software  instantiation  of  the  KSMERT  topic  modeling  methodology. 

Fitting  K-means  LDA  and  KSMERT. 

In  our  initial  evaluations,  the  most  significant  usability  issue  was  the  speed.  The  initial  version 
of  SMERT  software,  using  collapsed  Gibbs  sampling  based  LDA  models,  took  too  to  fit  corpora 
of  interest  to  the  user  population.  Our  sponsor’s  user  population  requires  a  tool  capable  of 
producing  topic  models  from  corpora  containing  thousands  of  documents  in  less  than  half  a 
minute.  In  response  to  this  need,  we  develop  the  K-means  based  estimation  method  described  in 
Chapter  2.  The  software  incorporates  this  methodology  as  an  option  to  fit  both  the  initial  LDA 
models  and  the  SMERT  models. 

Run  Testing. 

It  is  critical  for  quality  assurance  to  test  the  core  subroutines  using  examples  with  known  inputs 
and  outputs.  It  is  desirable  to  be  able  to  run  these  tests  at  any  time,  particularly  after  major  code 
changes,  to  assure  that  the  code  has  not  regressed.  To  address  this  requirement,  the  code 
includes  built-in  testing  features.  The  user  can  “Unhide”  the  test  cases  worksheet  and  click  “Run 
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Tests”.  The  resulting  outputs  assure  the  user  that  the  KSMERT  method  offers  comparable 
accuracy  to  collapsed  Gibbs  sampling  and  variational  inference  methods  as  shown  by  four  test 
cases.  Allen,  Xiong,  and  Afful-Dadzie  (2015)  further  describes  development  and  use  the  test 
problems  and  metrics. 

Retrieving  Top  Documents. 

During  beta  software  testing  with  our  BMC  user  group  at  Fort  Bliss,  Texas,  they  requested  an 
additional  application  of  the  topic  model.  The  users  have  a  need  fit  the  topic  model,  edit  the 
definitions,  and  then  retrieve  the  documents  most  relevant  to  a  specific  topic  on  demand.  This 
new  application  for  topic  model  methodologies  led  to  the  development  of  an  information 
retrieval  feature.  The  user  simply  specifies  the  topic  number  and  the  desired  number  of  top 
documents.  The  feature  then  retrieves  the  documents  (often  simply  rows  of  the  database)  which 
have  the  highest  proportion  of  their  words  estimated  to  be  on  the  relevant  topic. 

Generating  Period  Date  Chart. 

The  original  SMERT  software  divides  the  corpus  into  10  parts  effectively  assuming  a  time -based 
ordering  of  the  base  data.  This  allows  the  user  to  identify  changes  in  the  topic  with  the  highest 
proportion  across  fixed  buckets.  Our  current  KSMERT  software  instantiation  now  allows  the 
user  to  specify  time  intervals  such  as  day,  month  or  year  if  the  data  contains  a  date/time  field. 
This  allows  users  to  identify  changes  in  the  top  topic  across  time  domains  containing  different 
quantities  of  documents. 

Interacting  with  Forms. 

A  key  feature  of  the  software  development  for  this  project  period  is  a  graphical  user  interface 
including  user  forms  to  elicit  directions  for  data  fitting  and  table  creation.  Additionally,  we  add 
icons  the  Excel  ribbon  to  allow  users  to  run  KSMERT  in  any  worksheet  without  needing  to  copy 
the  data  into  a  specific  sheet.  The  user  interface  greatly  improves  the  usability  and 
professionalism  of  the  software. 

Including  an  Add-In  Installer. 

The  user  has  two  options.  First,  the  user  can  copy  the  data  into  the  KSMERT  workbook  and 
perform  analysis.  Alternatively,  the  user  can  apply  KSMERT  as  an  Add-In.  After  loading  the 
path  information  for  the  KSMERT  file,  the  user  can  then  use  KSMERT  inside  any  Excel 
workbook.  These  two  options  increase  the  flexibility  of  the  software  and  may  aid  its  usefulness 
in  classified  environments.  Appendix  D  contains  detailed  instructions  for  the  installation  of  the 
add-in. 
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Chapter  4.  Summary  and  Conclusions 


We  propose  an  innovative  estimation  technique  using  K-means  clustering  to  fit  LDA  topic 
models.  We  also  integrate  our  K-means  clustering  technique  with  the  original  SMERT  model 
methodology  to  produce  KSMERT  models.  We  demonstrate  through  test  problems  that 
KSMERT  can  achieve  improved  repeatability  and  comparable  subjective  accuracy.  Specifically, 
we  use  four  cases  to  test  our  new  model  against  the  true  models.  The  improved  efficiency  is 
important  for  enabling  spreadsheet  applications  or  the  use  of  topic  modeling  techniques  on  large 
data  sets. 

A  number  of  areas  for  future  improvement  in  fitting  topic  models  remain  for  future  study.  Other 
techniques  besides  K-means  based  estimation,  such  as  Fuzzy-C  clustering,  deserve  further 
research.  In  addition,  additional  comparison  metrics  and  test  cases  might  better  clarify  the 
accuracy  limitations  of  KSMERT  methods.  New  evaluation  metrics  could  be  more  objective  and 
interpretable  than  RMS.  Currently,  the  running  time  experiments  involve  only  small  test  corpora 
from  Allen,  Hui,  and  Afful-Dadzie  (2015).  Larger  corpora  from  the  literature  may  serve  as  more 
respective  test  cases.  The  development  of  additional  visualization  methods  beyond  Pareto  charts 
may  increase  the  interpretability  of  results  for  the  analyst  and  customers. 

We  believe  that  KSMERT  is  a  valuable  software  tool  that  enables  analysts  without  a  coding 
capability  or  access  to  other  analytic  tools  the  ability  to  conduct  accurate  and  reproducible 
analysis  on  large  free  text  data  sets.  Additionally,  we  believe  that  the  development  of  the  K- 
means  estimation  technique  for  LDA  models  is  a  significant  advancement  in  the  topic  modeling 
field.  At  the  same  time,  we  see  opportunities  for  further  improvements.  Below  are  our  top  five 
potential  directions  for  further  improvement. 

1.  Further  improvements  in  computational  speed  are  possible  by  simply  coding  the  core 
operations  in  C++.  This  likely  would  require  the  creation  two  files  instead  of  one, 
making  transportability  more  difficult,  but  it  may  yield  20x  or  50x  the  speed  increase. 

2.  The  comparison  of  alternative  estimation  methods  deserves  expansion  beyond  the  4 
small,  arbitrarily  generated  cases  from  Chapter  2.  A  more  thorough  comparison  should 
make  use  of  test  problems  from  other  literature  besides  Allen,  Xiong,  and  Afful-Dadzie 
(2015).  Additionally,  ground  truth  topic  models  could  generate  corpora  with  the  different 
estimation  methods  measured  against  their  ability  to  reconstitute  the  ground  truth  models. 

3.  The  development  of  additional  visualizations  can  increase  the  interpretability  of  the 
information  for  both  the  analyst  and  the  supported  decision  maker.  Plotting  the  top  topic 
in  the  time  series  is  useful  but  methods  to  visualize  all  the  topics  over  time  that  are 
present  in  other  software  and  articles  could  be  included.  New  visualizations,  designed  to 
facilitate  the  focus  on  specific  contrasts  relevant  to  specific  issues  or  technologies, 
deserve  further  research  and  development. 

4.  Further  improvement  to  the  accuracy  of  fast  surrogate  estimation  techniques,  like 
K-means,  is  likely  possible  from  a  careful  study  of  the  SMERT  and  LDA  likelihoods  to 
fashion  an  improved  surrogate.  Improvements  in  computational  efficiency  and  reductions 
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in  bias  in  estimation  are  likely  possible.  KSMERT  is  a  promising  start  but  additional 
research  is  possible. 

5.  Porting  KSMERT  to  a  coding  language  used  by  the  wider  analytic  community.  The 

use  of  VBA  as  the  language  the  best  supports  software  deployment  only  makes  sense  in 
the  restricted  environs  of  the  DoD  computing  abyss.  By  coding  KSMERT,  ideally  as  a 
deployable  package,  in  a  language  such  as  R  or  Python  we  significantly  expand  the 
power  of  the  methodology.  Additionally,  this  allows  wider  populations  form  the  analytic 
and  academic  communities  to  use,  review,  and  improve  on  our  research. 
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Appendix  B  -  Glossary 

BMC 

Brigade  Modernization  Command 

DoD 

Department  of  Defense 

KSMERT 

K-mean  Subject  Matter  Expert  Refined  Topic 

LDA 

Latent  Dirichlet  Allocation 

NIE 

Network  Integration  Exercise 

SME 

Subject  Matter  Expert 

SMERT 

Subject  Matter  Expert  Refined  Topic 

TRAC 

TRADOC  Analysis  Center 

WSMR 

White  Sands  Missile  Range 

VBA 

Visual  Basic  for  Applications 
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Appendix  C  -  Numerical  Examples  for  KSMERT 


This  appendix  contains  data  for  the  case  studies  including  the  true  model  that  originally  appeared 
in  Allen,  Xiong,  and  Afful-Dadzie  (2015). 


Table  C-l.  Synthetic  data  for  the  numerical  example. 


Doc# 
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12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 
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27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 


Document 

The  operator  cut  aluminum  and  dropped  it  at  stationl. 

The  inspector  drilled  plastic  and  overheated  it  at  station2. 

The  manager  milled  steel  and  misaligned  it  at  station3. 

The  engineer  saw  stone  and  over  torqued  on  the  truck. 

The  supplier  welded  and  misdimensioned  the  titanium  offsite. 
The  inspector  drilled  plastic  and  overheated  it  at  station2. 

It  was  drilled  and  overheated. 

It  was  drilled  and  overheated. 

The  engineer  and  the  manager  at  station3  and  on  the  truck. 
The  welded  titanium  was  misdimensioned. 

The  titanium  was  welded  and  misdimensioned  offsite. 

The  steel  was  misdimensioned. 

The  operator  cut  the  steel  and  plastic. 

The  manager  welded  it  and  misdimensioned  it. 

The  operator  cut  and  dropped  the  aluminum  at  stationl. 

The  operator  cut  and  dropped  it  at  stationl. 

The  engineer  welded  and  misdimensioned  the  titanium. 

It  was  drilled  and  overheated. 

It  was  drilled  and  overheated. 

The  manager  milled  steel  and  misaligned  it  at  station3. 

The  operator  cut  and  dropped  the  steel  at  stationl. 

The  engineer  and  the  manager  at  station3  and  offsite. 

It  was  drilled  and  overheated. 

The  engineer  saw  stone  and  over  torqued  on  the  truck. 

The  stone  was  drilled  and  overheated. 

It  was  drilled  and  overheated. 

It  was  drilled  and  overheated. 

It  was  drilled  and  overheated  offsite. 

The  supplier  welded  titanium  and  misdimensioned  it  offsite. 
The  operator  cut  and  dropped  the  titanium  at  stationl. 

The  operator  cut  and  dropped  it  at  stationl. 

It  was  steel. 

The  steel  was  drilled  and  overheated. 

It  was  drilled  and  overheated  at  station3. 

The  engineer  and  the  manager  at  stationl  and  on  the  truck. 
The  welded  titanium  was  misdimensioned. 

It  was  drilled  and  overheated. 

It  was  drilled  and  overheated. 

The  supplier  welded  titanium  and  misdimensioned  it  offsite. 

It  was  drilled  and  overheated. 
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Table  C-2.  True  model  for  the  numerical  example. 
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Appendix  D  -  SMERT  Users  Guide 


Instructions  for  Installing  KSMERT  as  an  Add-In 

It  is  important  to  note  that  KSMERT  is  most  easily  used  as  a  spreadsheet,  loading  data  into  it  and 
obtaining  results.  However,  frequent  users  might  prefer  to  have  KSMERT  loaded  into  their 
environments  so  that  they  can  use  KSMERT  in  any  sheet  as  an  add-in  similar  (perhaps)  to  the 
excel  solver.  The  following  are  steps  for  using  KSMERT  as  an  add-in. 

Step  1.  Download  the  software  “KSMERT_DLL_v....xlsm”  (current  version)  and  save  it  on  the 
desktop  or  in  any  designated  folder.  Rename  the  file  to  “SMERT. xlsm”  (optional).  In  Figure  D- 
1,  the  illustration  saves  the  results  to  a  SMERT  folder  on  the  desktop. 


Figure  D-l.  A  directory  with  SMERT  (or  KSMERT). 


Step  2.  Open  this  file.  Click  file  — >  Save  As.  Then  Choose  “Save  as  type”  as  “Excel  Add-in”. 
Then  the  Add-in  is  automatically  be  saved  to  the  directory 

C:\Users\sui\AppData\Roaming\Microsoft\AddIns.  This  step  installs  the  KSMERT  add-in 
module  to  excel  systems.  See  Figure  D-2  and  Figure  D-3  to  illustrate  the  steps. 
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Figure  D-2.  Depiction  of  the  “Save  As”  feature  in  excel. 
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Figure  D-3.  Illustration  of  the  procedure  for  generating  an  add-in. 
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Step  3.  Open  a  previously  saved  excel  file  that  you  want  to  analyze.  Then  go  to  file  — >  Options. 
At  the  very  bottom  click  the  button  “Go. . The  add-in  “Smert”  should  be  automatically  there.  If 
not,  you  can  click  “Browse...”  to  select.  Check  the  “Smert”  add-in.  and  click  “OK”.  The 
KSMERT  add-in  is  now  available  to  use.  See  Figure  D-4,  Figure  D-5,  and  Figure  D-6. 
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Figure  D-4.  The  selection  of  options  to  include  an  add-in. 
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Figure  D-5.  Illustration  showing  where  KSMERT  can  be  selected  before  “Go. 
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Figure  D-6.  The  last  selection  for  the  SMERT  (or  KSMERT)  add-in. 


Step  4.  The  KSMERT  button  set  should  now  be  available  on  your  tool  bar.  To  carry  out  your 
analysis,  click  “Load  Worksheets”  to  load  the  SMERT  worksheets.  For  the  Directory  Path,. 
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“C : \User s\s ui\Desktop\S MERT Y ” .  Be  sure  to  include  \”  at  the  very  end  of  the  directory  name. 
Then  for  File  Name,  input  the  file  Step  1.  name  which,  in  was  set  in  Step  1  as  “SMERT.xlsm”. 
Then  click  the  “Load  Worksheets”  button. 
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Figure  D-7.  The  workbook  showing  where  the  worksheets  can  be  loaded. 
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After  these  steps,  the  program  is  ready  and  analysis  is  possible  using  KSMERT  in  different 
workbooks  as  indicated  in  Figure  D-9. 
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Figure  D-9.  The  final  add-in  version  in  operation. 


Instructions  for  Using  SMERT  and  KSMERT 

After  opening  an  Excel  workbook  with  KSMERT  in  it  or  loading  the  add-ins,  there  is  a  SMERT 
item  in  the  ribbon.  Select  Run  LDA/SMERT  from  the  SMERT  ribbon  or,  alternatively,  click  on 
the  Run  SMERT  button  in  the  Start  Sheet.  The  dialog  appears  as  pictured  in  Figure  D-10.  Use 
the  cursor  to  click  on  “Select  Text”  and  enter  the  range  of  cells  with  the  text.  If  the  data  has  a 
header  row,  click  on  “My  data  has  headers”  in  the  dialog.  Then  select  the  number  of  “topics”  or 
clusters. 

Often,  using  10  topics  is  a  reasonable  starting  point.  For  preliminary  results,  keep  the  maximum 
number  of  iterations  to  be  5.  For  defensible  results  select  30  and  convergence  will  often  occur 
automatically  before  30  is  reached.  “Koreans”  is  the  innovative  estimation  method  in  the  next 
chapter.  It  is  much  faster  than  collapsed  Gibbs  sampling.  Click  on  “Run  LDA”  since  SMERT  is 
not  available  until  after  LDA  is  run. 
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Figure  D-10.  The  basic  LDA  or  SMERT  dialog  in  which  data  are  entered. 


After  LDA  is  run,  the  clusters  are  represented  by  the  words  with  the  highest  posterior  mean 
probability  associated  with  the  given  topic  as  a  list  (Figure  D-ll).  The  user  can  then  “boost”  to 
affirm  or  “zap”  to  remove  any  of  the  top  words.  This  is  shown  in  the  figure.  In  this  case,  expert 
judgement  suggests  that  prcl  17  does  not  relate  to  topic  one  which  is  about  radios  and  report. 
After  “editing”  the  topic  definitions,  rerun  by  either  selecting  “Run  LDA  SMERT”  or  by  clicking 
“Run  SMERT”  as  before.  Click  Yes. . .  and  Yes. . . .  These  relate  to  a  check  that  the  LDA  was  run 
or  else  SMERT  cannot  be  applied. 
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Figure  D-ll.  Spreadsheet  with  boosts  and  zaps  to  edit  the  topic  definitions. 


After  running  both  the  LDA  and  SMERT  steps,  then  the  Pareto  and  time  series  visualizations  are 
available.  The  time  series  are  plots  of  the  top  topic  definitions  by  posterior  mean  probability 
(approximately)  for  periods  of  selected  durations  (days,  months,  years,....).  Also,  either  clicking 
on  “Make  Top  Document  Table”  or  selecting  the  “topicTopDocumentTable”  worksheet  permits 
retrieval  of  the  top  documents  by  posterior  probability.  The  results  are  similar  to  the  application 
of  a  standard  search  engine  as  documents  on  topics  are  ordered  and  provided  as  shown  in  Figure 
D-12.  At  this  point  the  clusters  are  defined  with  editing  so  that  they  are  relevant.  Also,  the  top 
documents  on  any  topic  can  be  retrieved  for  inspection  as  illustrated  in  Figure  D-12. 
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Figure  D-12.  The  document  retrieval  table  for  a  case  study  example. 
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